Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/09/24 01:13:15 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/09/24 01:13:16 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
[Stage 0:> (0 + 1) / 1] [Stage 1:> (0 + 1) / 1]
Assignment 03
1 Companies Table
[Stage 2:> (0 + 1) / 1]
| company_name | company_raw | company_is_staffing | company_id | |
|---|---|---|---|---|
| 0 | Crowe | Crowe | False | 0 |
| 1 | The Devereux Foundation | The Devereux Foundation | False | 1 |
| 2 | Elder Research | Elder Research | False | 2 |
| 3 | NTT DATA | NTT DATA Inc | False | 3 |
| 4 | Frederick National Laboratory For Cancer Research | Frederick National Laboratory for Cancer Research | False | 4 |
25/09/24 01:13:41 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
[Stage 5:> (0 + 1) / 1]
2 Data Preparation (Clean Up Data)
[Stage 6:> (0 + 1) / 1] [Stage 7:> (0 + 1) / 1] [Stage 8:> (0 + 1) / 1] [Stage 9:> (0 + 1) / 1]
Medians : 87295.0 130042.0 115024.0
Data cleaning complete. Rows retained: 72498
3 Salary Distribution by Industry and Employment Type
[Stage 10:> (0 + 1) / 1]
[Stage 11:> (0 + 1) / 1]
| EMPLOYMENT_TYPE_NAME | SALARY | |
|---|---|---|
| 0 | Part-time / full-time | 92500.0 |
| 1 | Full-time (> 32 hours) | 110155.0 |
| 2 | Full-time (> 32 hours) | 92962.0 |
| 3 | Full-time (> 32 hours) | 107645.5 |
| 4 | Full-time (> 32 hours) | 192800.0 |
[Stage 12:> (0 + 1) / 1]
4 3 Salary Analysis by ONET Occupation Type (Bubble Chart)
–Appendix 1: Asked Copilot to help, as my aggregation was not workiong correctly, but it was because of a mix of the aggregation and the sorting that we had done in the saturday help session. AI prompts attached.
[Stage 13:> (0 + 1) / 1]
5 4 Salary by Education Level (Two Groups)
Create two groups:
Associate’s or lower (GED, Associate, No Education Listed)
Bachelor’s (Bachelor’s degree)
Master’s (Master’s degree)
PhD (PhD, Doctorate, professional degree)
Plot scatter plots for each group using, MAX_YEARS_EXPERIENCE (with jitter), Average_Salary, LOT_V6_SPECIALIZED_OCCUPATION_NAME
After each graph, add a short explanation of key insights.
[Stage 16:> (0 + 1) / 1]
6 4 Salary by Education Level (Four Groups)
Create two groups:
Associate’s or lower (GED, Associate, No Education Listed)
Bachelor’s (Bachelor’s degree)
Master’s (Master’s degree)
PhD (PhD, Doctorate, professional degree)
Plot scatter plots for each group using, MAX_YEARS_EXPERIENCE (with jitter), Average_Salary, LOT_V6_SPECIALIZED_OCCUPATION_NAME
After each graph, add a short explanation of key insights.
[Stage 17:> (0 + 1) / 1]
#see appendix 2 – asked ai to help me fix the data being in a straight line and it suggested the jitter.